COHERENS workshop #2

Florian Ricour (ECOMOD)
Ludovic Lepers (MFC)

Part I

JupyterHub on ECMWF

A clear procedure to get there

  • Need an ECMWF account
  • Access to local terminal
    1. tsh login --proxy=jump.ecmwf.int
    2. ssh -X hpc-login
  • Tutorial to connect to JupyterHub
  • Multiple server options (Profile, CPU number, session duration, …)

Let’s try JupyterHub!

File System Features
File System Features Quota
HOME Backed up 10 GB
PERM No back up 500 GB
HPCPERM No back up 100 GB*/1 TB
SCRATCH No back up 50 TB*/2 TB
TMPDIR Deleted at the end of session/job 3 GB by default
* for users without HPC access such as ECS

Conclusion on Part I

  • GUI is more user-friendly than a cold heartless terminal
  • Take advantage of the HPC filesystem at your disposal
  • Recent service (April 2024), be on top of the game !
  • Session up to 7 days, just go the URL and you’re back online
  • Use it for other tasks than model simulations
  • Save time from heavy downloads/uploads from/to Sharepoint (Erk)
  • Preserve your laptop and run big python/R codes on ECMWF
  • You want to go home but need to wait for a script to finish ? \(\rightarrow\) ECMWF
  • I like it and I hope you’ll use it !

Short break

Part II

Artificial Intelligence, how smart is it?

Who hasn’t used ChatGPT here?

  • chatGPT (you know it)
  • Claude.ai (20$/month well spent)
  • GitHub Copilot (code completion)
  • Perplexity (search engine)
  • NotebookLM (nice podcasts)
  • DeepL (translation)
  • Many more

Artificial intelligence

AI is a vague terminology

Artificial Neural Network (ANN)

A simple ANN - The Perceptron

Based on an artificial neuron called threshold logic unit (TLU)

\[ \text{heaviside}(z) = \begin{cases} 0 & \text{if } z < 0 \\ 1 & \text{if } z \geq 0 \end{cases} \]

Perceptron composed of one or more TLUs

Every TLU connected to every input = fully connected layer or dense layer

  • \(h_{\mathbf{W},\mathbf{b}}(\mathbf{X}) = \phi(\mathbf{X}\mathbf{W} + \mathbf{b})\)
  • \(\mathbf{b} = \text{bias vector}, \text{one value per neuron}\)
  • \(\phi = \text{activation function}\)

\(\rightarrow\) backpropagation algorithm (see after)

Multilayer Perceptron - XOR example

A B A XOR B
0 0 0
0 1 1
1 0 1
1 1 0

ANN with deep stack of hidden layers = deep neural network

Tweaking parameters to minimize the cost

Famous optimization algorithm - gradient descent (iterative process)

  • \(\boldsymbol{\theta} = \text{parameter vector}\)
  • \(\eta = \text{learning step} = \text{learning rate}\)
  • \(\boldsymbol{\theta}^{(\text{next step})} = \boldsymbol{\theta} - \eta \nabla_{\boldsymbol{\theta}} \text{Cost}(\boldsymbol{\theta})\)
  • \(\text{If Cost}(\boldsymbol{\theta}) = \text{RMSE}(\boldsymbol{\theta})\)
    • \(\text{min Cost}(\boldsymbol{\theta}) = min \sqrt{\frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)^2}\)

  • \(\boldsymbol{\theta} = \text{parameter vector}\)
  • \(\eta = \text{learning step} = \text{learning rate}\)
  • \(\boldsymbol{\theta}^{(\text{next step})} = \boldsymbol{\theta} - \eta \nabla_{\boldsymbol{\theta}} \text{Cost}(\boldsymbol{\theta})\)
  • \(\text{If Cost}(\boldsymbol{\theta}) = \text{RMSE}(\boldsymbol{\theta})\)
    • \(\text{min Cost}(\boldsymbol{\theta}) = min \sqrt{\frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)^2}\)

Have you ever seen an ANN in action?

TensorFlow Playground (👉 try it !)

Now that we know you’ve all used chatGPT

  • chatGPT (you know it)
  • Claude.ai (20$/month well spent)
  • GitHub Copilot (code completion)
  • Perplexity (search engine)
  • NotebookLM (nice podcasts)
  • DeepL (translation)
  • Many more

Large Language Models (LLMs)

Part III

LLMs, what’s all the fuss about?

LLMs are huge neural networks

  • Billions of parameters (e.g. GPT3 - 175 billions)
  • Specialized in language processing
  • Most famous ones (e.g. GPT3) are proprietary (i.e. unknown weights)
  • Some models have open weights, in contrast to being fully open source

Converting text to machine data

  • Tokenization - splitting the text into tokens (👉 try it !)
  • Vectorization - each token receives a vector (e.g. GPT3 \(\rightarrow\) 12 288 dimensions)
  • Vector with semantic meaning (e.g. King - Man + Women = Queen)

Transformers - Attention is all you need

  • GPT - Generative Pretrained Transformers
  • A transformer is a neural network with many layers
  • Context window - a sequence of tokens used as model input
  • The model outputs a token
  • All tokens pass through the neural network and are modified based on the other tokens
  • black dog, the dog vector will be modified to account for the fact that the dog is black

Trained for probability, not truth

  • The output vector is converted into a probability distribution, from which the next token is selected
  • Each generated token becomes part of the new context window (i.e. continuous text generation)
  • Model training - Tries to predict the next token then backpropagation steps in
  • Trained to be the most probable, not the most accurate
  • Knowledge learned during training (constant, stored in the model parameters)
  • Knowledge from the context window (different at each interaction)

Interaction with your new AI companion

  1. There is a hidden system prompt that explains to the model that it must simulate a conversation
  2. Then a sentence is given by the assistant (e.g. Claude ❤️)
  3. A sentence can then be given by the user, the prompt
  4. Repeat 2 and 3 until you are satisfied

Transforming a LLM into a chatbot

  • The model is fine tuned with reinforcement learning human feedback
  • Human feedback - rating model responses as good or bad (see later Bing Chat)
  • Safety training through feedback reduces harmful outputs but may limit model capabilities
  • Models are optimized to generate responses that appear convincing to humans
  • Models are instructed to simulate assistant-human conversations using embedded system prompts
  • Prompt injection risks (i.e. jailbreak) - when model safety controls are bypassed

Part IV

Examples and best practices

The science of prompting

The less the model has to guess the better

Non exhaustive list

Write clear instructions

Help me with my project
Help me write the introduction for a school project about climate change

Split complex tasks into simpler subtasks

Analyze a dataset and create a visualization
Subtask 1: Load and clean the data
Subtask 2: Perform statistical analysis
Subtask 3: Create appropriate visualizations
Subtask 4: Write up insights

Advantage, lower error rates compared to using a single query to perform the whole task.

Give the model time to think with chain of thought reasoning (i.e. step-by-step reasoning)

Problem: If it takes 6 workers 4 days to build a wall, how long would it take 8 workers?
💡 As you solve this problem:
First, state what information is relevant
Then, explain how you’ll approach it
Show each calculation
Check if your answer makes sense
Explain why your answer is reasonable

Note, task decomposition can involve parallel processing, while chain of thought is typically sequential.

Role-based prompts As a deep-sea biologist: A submersible has discovered massive die-offs of tube worms at 2000m depth near a hydrothermal vent that was previously thriving. What would be your initial assessment?
You’re a coastal oceanographer working with a beach community: Local swimmers report unusual purple-blue floating organisms appearing in large numbers during the past week. What’s your response?

Advantage, this helps give more specialized answers.
Provide references and/or examples Upload documents
Provide text and/or code examples
Even screenshots work well

Advantage, it reduces fake answers and match text and/or code style
Tell the model what to do, rather than what not to do

Don’t use complex technical terms when explaining photosynthesis
Explain photosynthesis using everyday language and familiar examples like sunlight helping plants make their food

Context matters - start a new chat if you change topics

💡 Remember the continuous text generation

Going deeper in prompt engineering 🔧

With Claude but similar rules apply to other models.

Prompt generator and/or improver
Original prompt 😑
From the following list of Wikipedia article titles, identify which article this sentence came from.
Respond with just the article title and nothing else.

Article titles:
{{titles}}

Sentence to classify:
{{sentence}}
Improved prompt 😍
You are an intelligent text classification system specialized in matching sentences to Wikipedia article titles. Your task is to identify which Wikipedia article a given sentence most likely belongs to, based on a provided list of article titles.

First, review the following list of Wikipedia article titles:
<article_titles>
{{titles}}
</article_titles>

Now, consider this sentence that needs to be classified:
<sentence_to_classify>
{{sentence}}
</sentence_to_classify>

Your goal is to determine which article title from the provided list best matches the given sentence. Follow these steps:

1. List the key concepts from the sentence
2. Compare each key concept with the article titles
3. Rank the top 3 most relevant titles and explain why they are relevant
4. Select the most appropriate article title that best encompasses or relates to the sentence's content

Wrap your analysis in <analysis> tags. Include the following:
- List of key concepts from the sentence
- Comparison of each key concept with the article titles
- Ranking of top 3 most relevant titles with explanations
- Your final choice and reasoning

After your analysis, provide your final answer: the single most appropriate Wikipedia article title from the list.

Output only the chosen article title, without any additional text or explanation.
use XML tags to help the assistant parse your prompts
XML tip
Use tags like <instructions>, <example>, and <formatting> to clearly separate different parts of your prompt. This prevents Claude from mixing up instructions with examples or context.
XML power use tip
Combine XML tags with other techniques like multishot prompting (<examples>) or chain of thought (<thinking>, <answer>). This creates super-structured, high-performance prompts.
XML in practice
You’re a financial analyst at AcmeCorp. Generate a Q2 financial report for our investors.

AcmeCorp is a B2B SaaS company. Our investors value transparency and actionable insights.

Use this data for your report:<data>{{SPREADSHEET_DATA}}</data>

<instructions>
1. Include sections: Revenue Growth, Profit Margins, Cash Flow.
2. Highlight strengths and areas for improvement.
</instructions>

Make your tone concise and professional. Follow this structure:
<formatting_example>{{Q1_REPORT}}</formatting_example>
Going back to role play prompting
<context>
You are an expert mechanical engineer with 15+ years of experience in automotive design and manufacturing processes. You have:
- Led design teams at major automotive companies
- Deep knowledge of materials science and structural mechanics
- Experience with both conventional and electric vehicle architectures
- Expertise in manufacturing optimization and quality control systems
</context>

<constraints>
- Always explain engineering concepts using precise technical terminology
- Support recommendations with relevant engineering principles
- Consider both theoretical and practical manufacturing limitations
- When discussing specifications, include relevant industry standards
- If unsure about specific details, acknowledge limitations and explain general principles
</constraints>

<tone>
Professional and technical, but able to explain complex concepts clearly to both experts and non-experts.
</tone>
Be proactive and keep on learning 📚 Prompt library
🎓 Anthropic courses
👀 OpenAI and GitHub Copilot prompt engineering best practices
🔎 Practice makes you better

One more thing

Refrain from sharing everything in your prompts 😬

  • Personal or private information (e.g. full names, phone numbers, email addresses)
  • Sensitive information (e.g. financial information)
  • Private medical information
  • Copyrighted or trademarked material (e.g. subscription-only content, licensed software code)
  • Credentials (e.g. an API key, a password left in a code)

Images, audios and notes

Generative AI at its worst and best

Image generation from text or image

Audio generation from text

  • NotebookLM - podcast
    From a paper (Ricour et al., 2023) on carbon sequestration fluxes (the full podcast is 10 min long)
  • Suno, udio - music
    Based on COHERENS's documentation

Getting the info that matters the most

with NotebookLM, powered by Gemini 1.5

  • Upload your sources (PDFs, websites, YouTube videos, Google Docs/Slides, …)
  • NotebookLM will summarize them and make connections (suggest questions) between topics
  • Exact quotes from sources (relies only on uploaded documents)
  • Multimodal - can analyse images and/or plots as well
  • Confidential - data not used to train the model
  • Helpful to summarize papers, anticipate questions, prepare presentations, …

Coding can be fast

Really fast.

It feels like coding with 4 hands

With GitHub Copilot

Autocomplete, chat and command

Codeium is also an alternative to GitHub Copilot

Shiny app built from scratch

Amazingly fast.

New to Shiny? 👉 Try it!

A quick prompt - not strictly following the best practices !

Build an app with 

Dynamic dataset selection between built-in R datasets such as
<datasets>
quakes, iris, faithful, airquality, mtcars, CO2, USArrests and women
</datasets>
Automatic UI updates based on selected data

Multiple visualization options:
- Scatter plots with trend lines
- Box plots
- Violin plots

Interactive features:
- Variable selection for both axes
- Sample size control
- Data preview table
- Summary statistics

Artifacts, a dedicated space for code

  • Documents (markdown or plain text)
  • Code snippets (will not run except for some)
  • Websites (HTML pages)
  • SVG images
  • Diagrams (e.g. mermaid flowcharts)
  • Interactive React components

👀 I am not paid by Anthropic to promote Claude

A rapidly evolving ecosystem 🏃

Based on Claude (but the same applies for chatGPT & Co)

  • Claude 3.5 Sonnet (Jun 21, 2024)
  • Artifacts (Aug 27, 2024)
  • Upgrade Claude 3.5 Sonnet (Oct 22, 2024)
  • Analysis tool (Oct 24, 2024) - Analyzing and visualizing data from CSV files
  • Claude 3.5 Sonnet on Github Copilot
  • Model Context Protocol (Nov 25, 2024) - Connecting AI to data sources (Web, VScode, GitHub, …)

Next step, AI agents

This is a personal conclusion 💁

  1. Get on the train or be left at the station
  2. See it as an investment if the learning curve scares you
  3. Keep on learning - it’s moving (way too) fast
  4. AI agents are coming (for you), be prepared
  5. Come back to this presentation in 2-3 years

Feedback welcome !